HHMM-based Chinese Lexical Analyzer ICTCLAS

نویسندگان

  • Huaping Zhang
  • Hongkui Yu
  • Deyi Xiong
  • Qun Liu
چکیده

This document presents the results from Inst. of Computing Tech., CAS in the ACLSIGHAN-sponsored First International Chinese Word Segmentation Bakeoff. The authors introduce the unified HHMM-based frame of our Chinese lexical analyzer ICTCLAS and explain the operation of the six tracks. Then provide the evaluation results and give more analysis. Evaluation on ICTCLAS shows that its performance is competitive. Compared with other system, ICTCLAS has ranked top both in CTB and PK closed track. In PK open track, it ranks second position. ICTCLAS BIG5 version was transformed from GB version only in two days; however, it achieved well in two BIG5 closed tracks. Through the first bakeoff, we could learn more about the development in Chinese word segmentation and become more confident on our HHMM-based approach. At the same time, we really find our problems during the evaluation. The bakeoff is interesting and helpful.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Lexical Analysis Using Hierarchical Hidden Markov Model

This paper presents a unified approach for Chinese lexical analysis using hierarchical hidden Markov model (HHMM), which aims to incorporate Chinese word segmentation, Part-Of-Speech tagging, disambiguation and unknown words recognition into a whole theoretical frame. A class-based HMM is applied in word segmentation, and in this level unknown words are treated in the same way as common words l...

متن کامل

The ICT,CAS MT Systems for the IWSLT09 Evaluation

We only use the data provided by the organizer for each task. We first used the Chinese lexical analysis system ICTCLAS for splitting Chinese characters into words and a rule-based tokenizer for tokenizing English sentences. Then, we convert all alphanumeric characters to their 2byte representation. Finally, we ran GIZA++ and used the “grow-diagfinal” heuristic to get many-to-many word alignmen...

متن کامل

Chinese Named Entity Recognition Using Role Model

This paper presents a stochastic model to tackle the problem of Chinese named entity recognition. In this research, we unify component tokens of named entity and their contexts into a generalized role set, which is like part-of-speech (POS). The probabilities of role emission and transition are acquired after machine learning on a role-labeled data set, which is transformed from a hand-correcte...

متن کامل

Chinese Word Segmentation Based On Direct Maximum Entropy Model

Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...

متن کامل

A Hybrid Approach to Chinese-English Machine Translation

A hybrid method to Chinese-English machine translation is presented, a rule-based analysis is combined with statistical data. The rule-based lexical analyzer and syntactic analyzer leave some amount of ambiguity that are resolved using statistical approach. Hidden Markov Model(HMM) is used to return a score for each parts of speech, improved probabilistic context free grammar(PCFG) is used for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003